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Abstract 

We present an approach to learn a dense pixel-wise la¬ 
beling from image-level tags. Each image-level tag imposes 
constraints on the output labeling of a Convolutional Neu¬ 
ral Network (CNN) classifier. We propose Constrained CNN 
(CCNN), a method which uses a novel loss function to op¬ 
timize for any set of linear constraints on the output space 
(i.e. predicted label distribution) of a CNN. Our loss formu¬ 
lation is easy to optimize and can be incorporated directly 
into standard stochastic gradient descent optimization. The 
key idea is to phrase the training objective as a biconvex op¬ 
timization for linear models, which we then relax to nonlin¬ 
ear deep networks. Extensive experiments demonstrate the 
generality of our new learning framework. The constrained 
loss yields state-of-the-art results on weakly supervised se¬ 
mantic image segmentation. We further demonstrate that 
adding slightly more supervision can greatly improve the 
performance of the learning algorithm. 


1. Introduction 

In recent years, standard computer vision tasks, such 
as recognition or classification, have made tremendous 
progress. This is primarily due to the widespread adop¬ 
tion of Convolutional Neural Networks (CNNs) [1 1,19,20]. 
Existing models excel by their capacity to take advantage 
of massive amounts of fully supervised training data [28]. 
This reliance on full supervision is a major limitation on 
scalability with respect to the number of classes or tasks. 
For structured prediction problems, such as semantic seg¬ 
mentation, fully supervised, i.e. pixel-level, labels are both 
expensive and time consuming to obtain. Summarization 
of the semantic-labels in terms of weak supervision, e.g. 
image-level tags or bounding box annotations, is often less 
costly. Leveraging the full potential of these weak annota- 

The implementation code and trained models are available at the au¬ 
thor’s website. 
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Figure 1: We train convolutional neural networks from a set 
of linear constraints on the output variables. The network 
output is encouraged to follow a latent probability distribu¬ 
tion, which lies in the constraint manifold. The resulting 
loss is easy to optimize and can incorporate arbitrary linear 
constraints. 


tions is challenging, and existing approaches are susceptible 
to diverging into bad local optima from which recovery is 
difficult [6,16,25]. 

In this paper, we present a framework to incorporate 
weak supervision into the learning procedure through a se¬ 
ries of linear constraints. In general, it is easier to express 
simple constraints on the output space than to craft regu- 
larizers or adhoc training procedures to guide the learning. 
In semantic segmentation, such constraints can describe the 
existence and expected distribution of labels from image- 
level tags. For example, given a car is present in an image, 
a certain number of pixels should be labeled as car. 

We propose Constrained CNN (CCNN), a method which 
uses a novel loss function to optimize convolutional net¬ 
works with arbitrary linear constraints on the structured out¬ 
put space of pixel labels. The non-convex nature of deep 
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nets makes a direct optimization of the constraints diffi¬ 
cult. Our key insight is to model a distribution over la¬ 
tent “ground truth” labels while the output of the deep net 
follows this latent distribution as closely as possible. This 
allows us to enforce the constraints on the latent distribu¬ 
tion instead of the network output, which greatly simplifies 
the resulting optimization problem. The resulting objective 
is a biconvex problem for linear models. For deep nonlin¬ 
ear models, it results in an alternating convex and gradient 
based optimization which can be naturally integrated into 
standard stochastic gradient descent (SGD). As illustrated 
in Figure 1, after each iteration the output is pulled towards 
the closest point on the constrained manifold of plausible 
semantic segmentation. Our Constrained CNN is guided by 
weak annotations and trained end-to-end. 

We evaluate CCNN on the problem of multi-class se¬ 
mantic segmentation with varying levels of weak supervi¬ 
sion defined by different linear constraints. Our approach 
achieves state-of-the-art performance on Pascal VOC 2012 
compared to other weak learning approaches. It does not 
require pixel-level labels for any objects during the training 
time, but infers them directly from the image-level tags. We 
show that our constrained optimization framework can in¬ 
corporate additional forms of weak supervision, such as a 
rough estimate of the size of an object. The proposed tech¬ 
nique is general, and can incorporate many forms of weak 
supervision. 

2. Related Work 

Weakly supervised learning seeks to capture the signal 
that is common to all the positives but absent from all the 
negatives. This is challenging due to nuisance variables 
such as pose, occlusion, and intra-class variation. Learn¬ 
ing with weak labels is often phrased as Multiple Instance 
Learning [8]. It is most frequently formulated as a maxi¬ 
mum margin problem, although boosting [1, 36] and Noisy- 
OR models [15] have been explored as well. The multiple 
instance max-margin classification problem is non-convex 
and solved as an alternating minimization of a biconvex 
objective [2]. MI-SVM [2] or LSVM [10] are two classic 
methods in this paradigm. This setting naturally applies to 
weakly-labeled detection [3 1,34]. However, most of these 
approaches are sensitive to the initialization of the detec¬ 
tor [6]. Several heuristics have been proposed to address 
these issues [30, 31], however they are usually specific to 
detection. 

Traditionally, the problem of weak segmentation and 
scene parsing with image level labels has been addressed 
using graphical models, and parametric structured mod¬ 
els [32,33,37]. Most works exploit low-level image infor¬ 
mation to connect regions similar in appearance [32]. Chen 
et al. [5] exploit top-down segmentation priors based on vi¬ 
sual subcategories for object discovery. Pinheiro et al. [26] 


and Pathak et al. [25] extend the multiple-instance learning 
framework from detection to semantic segmentation using 
CNNs. Their methods iteratively reinforce well-predicted 
outputs while suppressing erroneous segmentations contra¬ 
dicting image-level tags. Both algorithms are very sensitive 
to the initialization, and rely on carefully pretrained classi¬ 
fiers for all layers in the convolutional network. In contrast, 
our constrained optimization is much less sensitive and re¬ 
covers a good solution from any random initialization of the 
classification layer. 

Papandreou et al. [24] include an adaptive bias into the 
multi-instance learning framework. Their algorithm boosts 
classes known to be present and suppresses all others. We 
show that this simple heuristic can be viewed as a special 
case of a constrained optimization, where the adaptive bias 
controls the constraint satisfaction. However the constraints 
that can be modeled by this adaptive bias are limited and 
cannot leverage the full power of weak labels. In this paper, 
we show how to apply more general linear constraints which 
lead to better segmentation performance. 

Constrained optimization problems have long been ap¬ 
proximated by artificial neural networks [35]. These models 
are usually non-parametric, and solve just a single instance 
of a linear program. Platt et al. [27] show how to optimize 
equality constraints on the output of a neural network. How¬ 
ever the resulting objective is highly non-convex, which 
makes a direct minimization hard. In this paper, we show 
how to optimize a constrained objective by alternating be¬ 
tween a convex and gradient-based optimization. 

The resulting algorithm is similar to generalized expec¬ 
tation [22] and posterior regularization [12] in natural lan¬ 
guage processing. Both methods train a parametric model 
that matches certain expectation constraints by applying a 
penalty to the objective function. Ceneralized expectation 
adds the expected constraint penalty directly to objective, 
which for convolutional networks is hard and expensive to 
evaluate directly. Canchev et al. [12] constrain an auxil¬ 
iary variable yielding an algorithm similar to our objective 
in dual space. 


3. Preliminaries 

We define a pixel-wise labeling for an image / as a set of 
random variables X = {xq, ..., where n is the num¬ 
ber of pixies in an image. Xi e C takes one of m dis¬ 
crete labels jC = CNN models a probability 

distribution Q{X\0,1) over those random variables, where 
0 are the parameters of the network. The distribution is 
commonly modeled as a product of independent marginals 
Q{X\0,I) = Yl-qi{xi\0,1), where each of the marginal 
represents a softmax probability: 
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Figure 2: Overview of our weak learning pipeline. In¬ 
put image is passed through a fully convolutional network 
(FCN) which produces an output labeling. The model is 
trained such that the output labeling follows a set of simple 
linear constraints imposed by image level tags. 


qi{xi\d,I) = ^exj){fi{xi;e,I)) ( 1 ) 

where is the partition func¬ 

tion of a pixel i. The function fi represents the real-valued 
score of the neural network. A higher score corresponds to 
a higher likelihood. 

Standard learning algorithms aim to maximize the like¬ 
lihood of the observed training data under the model. This 
requires full knowledge of the ground truth labeling, which 
is not available in the weakly supervised setting. In the next 
section, we show how to optimize the parameters of a CNN 
using some high-level constraints on the distribution of out¬ 
put labeling. An overview of this is given in Figure 2. In 
Section 5, we then present a few examples of useful con¬ 
straints for weak labeling. 

4. Constrained Optimization 

For notational convenience, let Qj be the vectorized 
form of network output Q{X\0, /). The Constrained CNN 
(CCNN) optimization can be framed as: 

find 0 

subject to AiQi V/, (2) 

where Aj G and bj G enforce k individual 

linear constraints on the output distribution of the convnet 
on image /. In theory, many outputs Qj satisfy these con¬ 
straints. However all network outputs are parametrized by 
a single parameter vector 0, which ties the output space of 
different Qj together. In practice, this leads to an output that 
is both consistent with the input image and the constraints 
imposed by the weak labels. 

For notational simplicity, we derive our inference algo¬ 
rithm for a single image with A = Ai,b = bi and Q = Qi. 
The entire derivation generalizes to an arbitrary number of 
images and constraints. Constraints include for example 


lower and upper bounds on the expected number of fore¬ 
ground and background pixel labels in a scene. For more 
examples, see Section 5. In the first part of this section, 
we assume that all constraints are satisfiable, meaning there 
always exists a parameter setting 0 such that AQ > b. In 
Section 4.3, we lift this assumption by adding slack vari¬ 
ables to each of the constraints. 

While problem (2) is convex in the network output Q, it 
is generally not convex with respect to the network param¬ 
eters 0. For any non-linear function Q, the matrix A can be 
chosen such that the constraint is an upper or lower bound 
to Q, one of which is non-convex. This makes a direct opti¬ 
mization hard. As a matter of fact, not even log-linear mod¬ 
els, such as logistic regression, can be directly optimized 
under this objective. Alternatively, one could optimize the 
Lagrangian dual of (2). However this is computationally 
very expensive, as we would need to optimize an entire con¬ 
volutional neural network in an inner loop of a dual descent 
algorithm. 

In order to efficiently optimize problem (2), we introduce 
a latent probability distribution P{X) over the semantic la¬ 
bels X. We constrain P{X) to lie in the feasibility region 
of the constrained objective while removing the constraints 
on the network output Q. We then encourage P and Q to 
model the same probability distribution by minimizing their 
respective KL-divergence. The resulting problem is defined 
as 

minimize D {P{X)\\Q{X\0)) 

9 

subject to AP >b, ^P(A) = 1, (3) 

where D {P{X)\\Q{X\e))^ = nX)\ogP{X) - 
'^Xr^p [^ogQ{X\0)] and P is the vectorized version of 
P{X). If the constraints in (2) are satisfiable then the prob¬ 
lems (2) and (11) are equivalent with a solution of (11) at 
P that is equal to the feasible Q. This equality implies 
that P{X) can be modeled as a product of independent 
marginals P{X) = Yl^Pi^Xi) without loss of generality, 
with a minimum at Pi{xi) = qi{xi\0). A detailed proof is 
provided in the supplementary material. 

The new objective is much easier to optimize, as it de¬ 
couples the constraints from the network output. For fixed 
network parameters 0, the problem is convex in P. For a 
fixed latent distribution P, the problem reduces to a stan¬ 
dard cross entropy loss which is optimized using stochastic 
gradient descent. 

In the remainder of this section, we derive an algorithm 
to optimize problem (11) using block coordinate descent. 
Section 4.1 solves the constrained optimization for P while 
keeping the network parameters 0 fixed. Section 4.2 then 
incorporates this optimization into standard stochastic gra¬ 
dient descent, keeping P fixed while optimizing for 0. Each 









step is guaranteed to decrease the overall energy of problem 
(11), converging to a good local optimum. At the end of 
this section, we show how to handle constraints that are not 
directly satisfiable by adding a slack variable to the loss. 

4.1. Latent distribution optimization 

We first show how to optimize problem (11) with respect 
to P while keeping the convnet output fixed. The objective 
function is convex with linear constraints, which implies 
Slaters condition and hence strong duality holds as long as 
the constraints are satisfiable [3]. We can therefore optimize 
problem (11) by maximizing its dual function, i.e., 

n 

£(A) = A"^6- y^logy^exp {fi{l-,0) + Aj.iX), (4) 

i=i lec 

where A > 0 are the dual variables pertaining to the in¬ 
equality constraints and fi{l; 0) is the score of the convnet 
classifier for pixel i and label L is the column of A 
corresponding to pi(l). A detailed derivation of this dual 
function is provided in the supplementary material. 

The dual function is concave and can be optimized glob¬ 
ally using projected gradient ascent [3]. The gradient of the 
dual function is given by -^C{X) = b — AP, which results 
into ^ 

Piixi) = — exp {Mxf, d) + AJ.^A), 

where Zi = exp {fi{l;0) -f is the local partition 

function ensuring that the distribution Pi{xi) sums to one 
for yxi G jC. Intuitively, the projected gradient descent al¬ 
gorithm increases the dual variables for all constraints that 
are not satisfied. Those dual variables in turn adjust the 
distribution pi to fulfill the constraints. The projected dual 
gradient descent algorithm usually converges within fewer 
than 50 iterations, making the optimization highly efficient. 

Next, we show how to incorporate this estimate of P{X) 
into the standard stochastic gradient descent algorithm. 

4.2. SGD 

For a fixed latent distribution P, problem (11) reduces to 
the standard cross entropy loss 

i Xi 

The gradient of this loss function is given by 
p L{6) = qi{xi\6) -pi{xi). For linear models, 
the loss function (5) is convex and can be optimized using 
any gradient based optimization. For multi-layer deep 
networks, we optimize it using back-propagation and 
stochastic gradient descent (SGD) with momentum, as 
implemented in Caffe [17]. 

Theoretically, we would need to keep the latent distribu¬ 
tion P fixed for a few iterations of SGD until the objective 



Figure 3: Illustration of our alternating convex optimization 
and gradient descent optimization for f = 0. At each itera¬ 
tion t, we compute a latent probability distribution P^^^ as 
the closest point in the constrained region. We then update 
the convnet parameters to follow P^^^ as closely as possible 
using Stochastic Gradient Descent (SGD), which takes the 
convnet output from to 

value decreases. Otherwise, we are not strictly guaranteed 
that the overall objective (11) decreases. However, in prac¬ 
tice we found inferring a new latent distribution at every 
step of SGD does not hurt the performance and leads to a 
faster convergence. 

In summary, we optimize problem (11) using SGD, 
where at each iteration we infer a latent distribution P 
which defines both our loss and loss gradient. Figure 3 
shows an overview of the training procedure. For more de¬ 
tails, see Section 6. 

Up to this point, we assumed that all the constraints are 
simultaneously satisfiable. While this might hold for care¬ 
fully chosen constraints, our optimization should be robust 
to arbitrary linear constraints. In the next section, we relax 
this assumption by adding a slack variable to the constraint 
set and show that this slack variable can then be easily inte¬ 
grated into the optimization. 

4.3. Constraints with slack variable 

We relax problem (11) by adding a slack ^ G to the 
linear constraints. The slack is regularized using a hinge 
loss with weight /3 G . It results into the following opti¬ 
mization: 

minmize D {P{X)\\Q{X\0)) + (3^^ 

subject to AP > 6—^, ^^P(X) = 1, C ^ 0- (6) 




This objective is now guaranteed to be satisfiable for any as¬ 
signment to P and any linear constraint. Similar to (4), this 
is optimized using projected dual coordinate ascent. The 
dual objective function is exactly same as (4). The weight¬ 
ing term of the hinge loss p merely acts as an upper bound 
on the dual variable i.e. 0 < A < ^5. A detailed derivation 
of this loss is given in the supplementary material. 

This slack relaxed loss allows the optimization to ignore 
certain constraints if they become too hard to enforce. It 
also trades off between various competing constraints. 

5. Constraints for Weak Semantic Segmenta¬ 
tion 

We now describe all constraints we use for our weakly 
supervised semantic segmentation. For each training image 
/, we are given a set of image-level labels £/. Our con¬ 
straints affect different parts of the output space depending 
on the image-level labels. All the constraints are comple¬ 
mentary, and each constraint exploits the set of image-level 
labels differently. 

Suppression constraint The most natural constraint is to 
suppress any label I that does not appear in the image. 

n 

J2pi{l)<0 -iliCi. (7) 

i=l 

This constraint alone is not sufficient, as a solution involv¬ 
ing all background labels satisfies it perfectly. We can easily 
address this by adding a lower-bound constraint for labels 
present in an image. 

Foreground constraint 

n 

ai < 'il eCi. (8) 

i=l 

This foreground constraint is very similar to the commonly 
used multiple instance learning (MIL) paradigm, where at 
least one pixel is constrained to be positive [2, 16,25, 26]. 
Unlike MIL, our foreground constraint can encourage mul¬ 
tiple pixels to take a specific foreground label by increasing 
ai. In practice, we set ai = 0.05n with a slack of ^ = 2, 
where n is the number of outputs of our network. 

While this foreground constraint encourages some of the 
pixels to take a specific label, it is often not strong enough 
to encourage all pixels within an object to take the cor¬ 
rect label. We could increase ai to encourage more fore¬ 
ground labels, but this would over-emphasize small objects. 
A more natural solution is to constrain the total number of 
foreground labels in the output, which is equivalent to con¬ 
straining the overall area of the background label. 


Background constraint 

n 

ao < ^Pi(O) < ho- (9) 

Here / = 0 is assumed to be the background label. We apply 
both a lower and upper bound on the background label. This 
indirectly controls the minimum and maximum combined 
area of all foreground labels. We found ao = 0.3n and 
6o = 0.7n to work well in practice. 

The above constraints are all complementary and ensure 
that the final labeling follows the image-level labels £/ as 
closely as possible. If we also have access to the rough 
size of an object, we can exploit this information during 
training. In our experiments, we show that substantial gains 
can be made by simply knowing if a certain object class 
covers more or less than 10% of the image. 

Size constraint We exploit the size constraint in two 
ways: We boost all classes larger than 10% of the image by 
setting ai = O.ln. We also put an upper bound constraint 
on the classes I that are guaranteed to be small 

n 

Y^Piil) < hi- ( 10 ) 

i—\ 

In practice, a threshold bi < 0.0In works slightly better 
than a tight threshold. 

The EM-Adapt algorithm of Papandreou et al. [24] can 
be seen as a special case of a constrained optimization prob¬ 
lem with just suppression and foreground constraints. The 
adaptive bias parameters then correspond to the Lagrangian 
dual variables A of our constrained optimization. How¬ 
ever in the original algorithm of Papandreou et ai, the con¬ 
straints are not strictly enforced especially when some of 
them conflict. In Section 7, we show that a principled opti¬ 
mization of those constraints, CCNN, leads to a substantial 
increase in performance. 

6. Implementation Details 

In this section, we discuss the overall pipeline of our al¬ 
gorithm applied for semantic image segmentation. We con¬ 
sider the weakly supervised setting i.e. only image-level 
labels are present during training. At test time, the task is to 
predict semantic segmentation mask for a given image. 

Learning The CNN architecture used in our experiments 
is derived from VGG 16-layer network [29]. It was pre¬ 
trained on Imagenet IK class dataset, and achieved win¬ 
ning performance on ILSVRC14. We cast the fully con¬ 
nected layers into convolutions in a similar fashion as sug¬ 
gested in [21], and the last fc8 layer with IK outputs is re¬ 
placed by that containing 21 outputs corresponding to 20 
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MIL-FCN [25] 






















24.9 

MIL-Base [26] 

37.0 

10.4 

12.4 

10.8 

05.3 

05.7 

25.2 

21.1 

25.2 

04.8 

21.5 

08.6 

29.1 

25.1 

23.6 

25.5 

12.0 

28.4 

08.9 

22.0 

11.6 

17.8 

MIL-Base w/ ILP [26] 

73.2 

25.4 

18.2 

22.7 

21.5 

28.6 

39.5 

44.7 

46.6 

11.9 

40.4 

11.8 

45.6 

40.1 

35.5 

35.2 

20.8 

41.7 

17.0 

34.7 

30.4 

32.6 

EM-Adapt w/o CRF [24] 

65.3 

28.2 

16.9 

27.4 

21.1 

28.1 

45.4 

40.5 

42.3 

13.2 

32.1 

23.3 

38.7 

32.0 

39.9 

31.3 

22.7 

34.2 

22.8 

37.0 

30.0 

32.0 

EM-Adapt [24] 

67.2 

29.2 

17.6 

28.6 

22.2 

29.6 

47.0 

44.0 

44.2 

14.6 

35.1 

24.9 

41.0 

34.8 

41.6 

32.1 

24.8 

37.4 

24.0 

38.1 

31.6 

33.8 

CCNN w/o CRF 

66.3 

24.6 

17.2 

24.3 

19.5 

34.4 

45.6 

44.3 

44.7 

14.4 

33.8 

21.4 

40.8 

31.6 

42.8 

39.1 

28.8 

33.2 

21.5 

37.4 

34.4 

33.3 

CCNN 

68.5 

25.5 

18.0 

25.4 

20.2 

36.3 

46.8 

47.1 

48.0 

15.8 

37.9 

21.0 

44.5 

34.5 

46.2 

40.7 

30.4 

36.3 

22.2 

38.8 

36.9 

35.3 


Table 1: Comparison of weakly supervised semantic segmentation methods on PASCAL VOC 2012 validation set. 


object classes in Pascal VOC and background class. The 
overall network stride of this fully convolutional network 
is 32s. However, we observe that the slightly modified ar¬ 
chitecture with the denser 8s network stride proposed in [4] 
gives better results in the weakly supervised training. Un¬ 
like [25,26], we do not learn any weights of the last layer 
from Imagenet. Apart from the initial pre-training, all pa¬ 
rameters are finetuned only on Pascal VOC. We initialize 
the weights of the last layer with random Gaussian noise. 

The FCN takes in arbitrarily sized images and produces 
coarse heatmaps corresponding to each class in the dataset. 
We apply the convex constrained optimization on these 
coarse heatmaps, reducing the computational cost. The net¬ 
work is trained using SGD with momentum. We follow [21] 
and train our models with a batch size of 1, momentum 
of 0.99 and an initial learning rate of le-6. We train for 
60000 iterations, which corresponds to roughly 5 epochs. 
The learning rate is decreased by a factor of 0.1 every 20000 
iterations. We found this setup to outperform a batch size 
of 20 with momentum of 0.9 [4]. The constrained optimiza¬ 
tion for single image takes less than 30 ms on a CPU single 
core, and could be accelerated using a GPU. The total train¬ 
ing time is 8-9 hrs, comparable to [21,24]. 

Inference At inference time, we optionally apply a fully 
connected conditional random field model [18] to refine the 
final segmentation. We used the default parameter provided 
by the authors for all our experiments. 

7. Experiments 

We analyze and compare the performance of our con¬ 
strained optimization for varying levels of supervision: 
image-level tags and additional supervision such as object 
size information. The objective is to learn models to predict 
dense multi-class semantic segmentation i.e. pixel-wise la¬ 
beling for any new image. We use the provided supervi¬ 
sion with few simple spatial constraints on the output, and 
don’t use any additional low-level graph-cut based meth¬ 
ods in training. The goal is to demonstrate the strength of 
training with constrained outputs, and how it helps with in¬ 
creasing levels of supervision. 


7.1. Dataset 

We evaluate CCNNs for the task of semantic image seg¬ 
mentation on PASCAL VOC dataset [9]. The dataset con¬ 
tains pixel-level labels for 20 object classes and a separate 
background class. For a fair comparison to prior work, we 
use the similar setup to train all models. Training is per¬ 
formed on the union of VOC 2012 train set and the larger 
dataset collected by Hariharan et al. [13] summing upto a 
total of 10,582 training images. The VOC12 validation set 
containing a total of 1449 images is kept held-out during ab¬ 
lation studies. The VGG network architecture used in our 
algorithm was pre-trained on ILSVRC dataset [28] for clas¬ 
sification task of IK classes [29]. 

Results are reported in the form of standard intersection 
over union (loU) metric, also known as Jaccard Index. It is 
defined per class as the percentage of pixels predicted cor¬ 
rectly out of total pixels labeled or classified as that class. 
Ablation studies and comparison with baseline methods for 
both the weak settings are presented in the following sub¬ 
sections. 

7.2. Training from image-level tags 

We start by training our model using just image-level 
tags. We obtain these tags from the presence of a class 
in the pixel-wise ground truth segmentation masks. The 
constraints used in this setting are described in Equa¬ 
tions (7), (8) and (9). Since some of the baseline methods 
report results on the VOC 12 validation set, we present the 
performance on both validation and test set. Some methods 
boost their performance by using a Dense CRF model [18] 
to post process the final output labeling. To allow for a 
fair comparison, we present results both with and without 
a Dense CRF. 

Table 1 compares all contemporary weak segmentation 
methods. Our proposed method, CCNN, outperforms all 
prior methods for weakly labeled semantic segmentation 
by a significant margin. MIL-FCN [25] is an extension of 
learning based on maximum scoring instance based MIL 
to multi-class segmentation. The algorithm proposed by 
Pinheiro et al. [26] introduces a soft version of MIL. It is 
trained on 0.7 million images for 21 classes taken from 
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Fully Supervised: 

SDS [14] 

63.3 

25.7 

63.0 

39.8 

59.2 

70.9 

61.4 

54.9 

16.8 

45.0 

48.2 

50.5 

51.0 

57.7 

63.3 

31.8 

58.7 

31.2 

55.7 

48.5 

51.6 

FCN-Ss [21] 

76.8 

34.2 

68.9 

49.4 

60.3 

75.3 

74.7 

77.6 

21.4 

62.5 

46.8 

71.8 

63.9 

76.5 

73.9 

45.2 

72.4 

37.4 

70.9 

55.1 

62.2 

TTIC Zoomout [23] 

81.9 

35.1 

78.2 

57.4 

56.5 

80.5 

74.0 

79.8 

22.4 

69.6 

53.7 

74.0 

76.0 

76.6 

68.8 

44.3 

70.2 

40.2 

68.9 

55.3 

64.4 

DeepLab-CRF [4] 

78.4 

33.1 

78.2 

55.6 

65.3 

81.3 

75.5 

78.6 

25.3 

69.2 

52.7 

75.2 

69.0 

79.1 

77.6 

54.7 

78.3 

45.1 

73.3 

56.2 

66.4 

Weakly Supervised: 

CCNN w/ tags 

24.2 

19.9 

26.3 

18.6 

38.1 

51.7 

42.9 

48.2 

15.6 

37.2 

18.3 

43.0 

38.2 

52.2 

40.0 

33.8 

36.0 

21.6 

33.4 

38.3 

35.6 

CCNN w/ size 

36.7 

23.6 

47.1 

30.2 

40.6 

59.5 

54.3 

51.9 

15.9 

43.3 

34.8 

48.2 

42.5 

59.2 

43.1 

35.5 

45.2 

31.4 

46.2 

42.2 

43.3 

CCNN w/ size (CRF tuned) 

42.3 

24.5 

56.0 

30.6 

39.0 

58.8 

52.7 

54.8 

14.6 

48.4 

34.2 

52.7 

46.9 

61.1 

44.8 

37.4 

48.8 

30.6 

47.7 

41.7 

45.1 


Table 2: Results on PASCAL VOC 2012 test. We compare our results to the fully supervised state-of-the-art methods. 


ILSVRC13, which is 70 times more data than all other ap¬ 
proaches used. They achieve boost in performance by re¬ 
ranking the pixel probabilities with the image-level priors 
(ILP) i.e. the probability of class to be present in the image. 
This suppresses the negative classes and smooths out the 
predicted segmentation mask. For the EM-Adapt [24] algo¬ 
rithm, we reproduced the models using their publicly avail¬ 
able implementation ^ We apply similar set of constraints 
on EM-Adapt to make sure it is purely a comparison of the 
approach. Note that unconstrained MIL based approach re¬ 
quire the final 21-class classifier to be well initialized for 
reasonable performance. While our constrained optimiza¬ 
tion can handle arbitrary random initializations. 

We also directly compare our algorithm against the EM- 
Adapt results as reported in Papandreou et al. [24] for weak 
segmentation. However, their training procedure uses ran¬ 
dom crops of both the original image and the segmentation 
mask. The weak labels are then computed from those ran¬ 
dom crops. This introduces limited information about the 
spatial location of the weak tags. Taken to the extreme, 
a 1 X 1 output crop reduces to full supervision. We thus 
present this result in the next subsection on incorporating 
increasing supervision. 

7.3. Training with additional supervision 

We now consider slightly more supervision than just the 
image-level tags. Firstly, we consider the training with tags 
on random crops of original image, following Papandreou 
et al. [24]. We evaluate our constrained optimization on 
the EM-Adapt architecture using random crops, and com¬ 
pare to the result obtained from their released caffemodel as 
shown in Table 3. Using limited spatial information our al¬ 
gorithm slightly outperforms EM-Adapt, mainly due to the 
more powerful background constraints. Note that the differ¬ 
ence is not as striking as in the pure weak label setting. We 
believe this is due to the fact that the spatial information in 
combination with the foreground prior emulates the upper 

^https://bitbucket.org/deeplab/deeplab-public/ 
overview 


bound constraint on background, as a random crop is likely 
to contain much fewer labels. 


Method 

Training Supervision 

mloU 

w/o CRF 

mloU 

EM-Adapt [24] 

Tags w/ random crops 

34.3 

36.0 

CCNN 

Tags w/ random crops 

34.4 

36.4 

EM-Adapt [24] 

Tags w/ object sizes 

- 

- 

CCNN 

Tags w/ object sizes 

40.5 

42.4 


Table 3: Results using additional supervision during train¬ 
ing evaluated on the VOC 2012 validation set. 

The main advantage of CCNN is that there is no restric¬ 
tion of the type of linear constraints that can be used. To 
demonstrate this further, we incorporate a simple size con¬ 
straint. For each label, we use one additional bit of infor¬ 
mation: whether a certain class occupies more than 10% 
of the image or not. This additional constraint is described 
in Equation (10). As shown in Table 3, using this one ad¬ 
ditional bit of information dramatically increases the accu¬ 
racy. Unfortunately, EM-Adapt heuristic cannot directly in¬ 
corporate this more meaningful size constraint. 

Table 2 reports our results on PASCAL VOC 2012 test 
server and compares it to fully supervised approaches. To 
better compare with these methods, we further add a re¬ 
sult where the CRE parameters are tuned on 100 validation 
images. As a final experiment, we gradually add fully su¬ 
pervised images in addition to our weak objective and eval¬ 
uate the model, i.e., semi-supervised learning. The graph 
is shown in the supplementary material. Our model makes 
good use of the additional supervision. 

We also evaluate the sensitivity of our model to the pa¬ 
rameters of the constraints. We performed line search along 
each of the bounds while keeping others fixed. In general, 
our method is very insensitive to wide range of constraint 
bounds due to the presence of slack variables. The stan¬ 
dard deviation in accuracy, averaged over all parameters, is 
0.73%. Details are provided in the supplementary material. 

Qualitative results are shown in Eigure 4. 












(a) Original image (b) Ground truth (c) Image tags (d) Image tags + size 

Figure 4: Qualitative results on the VOC 2012 dataset for different levels of supervision. We show the original image, ground 
truth, our trained classifier with image level tags and with size constraints. Note that the size constraints localize the objects 
much better than just image level tags at the cost of missing small objects in few examples. 


7.4. Discussion 

We further experimented with bounding box constraints. 
We constrain 75% of pixels within a bounding box to take 
a specific label, while we suppress any labels outside the 
bounding box. This additional supervision allows us to 
boost the loU accuracy to 54%. This number is compet¬ 
itive with a baseline for which we train a model on all 
pixels within a bounding box, which gives 52.3% [24]. 
However it is not yet competitive with more sophisticated 
systems that use more segmentation information within 
bounding boxes [7,24]. Those systems perform at roughly 
58.5 — 62.0% loU accuracy. We believe the key to this per¬ 
formance is a stronger use of the pixel level segmentation 
information. 


In conclusion, we presented CCNN which is a con¬ 
strained optimization framework to optimize convolutional 
networks. The framework is general and can incorporate 
arbitrary linear constraints. It naturally integrates into stan¬ 
dard Stochastic Gradient Descent, and can easily be used in 
publicly available frameworks such as Caffe [17]. 

We showed that constraints are a natural way to describe 
the desired output space of a labeling and can reduce the 
amount of strong supervision CNNs require. 
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Supplementary Material: 
Constrained Convolutional Neural Networks for 
Weakly Supervised Segmentation 


This document provides detailed derivation for the re¬ 
sults used in the paper and additional quantitative experi¬ 
ments to validate the robustness of CCNN algorithm. 

In the paper, we optimize constraints on CNN output by 
introducing latent probability distribution P{X). Recall the 
overall main objective as follows: 

minimize — Hp —[log Q{X\0)] 

9,P \ __/ 

Hp\Q 

subject to AP > 6, = 1, (11) 

where Hp = — P{X) log P{X) is the entropy of la¬ 
tent distribution, Pp|Q is the cross entropy and P is the vec¬ 
torized version of P(X). In this supplementary material, 
we will show how to minimize the objective with respect to 
P. For the complete block coordinate descent minimization 
algorithm with respect to both P and 0, see the main paper. 

Note that the objective function in (11) is KL Divergence 
of network output distribution Q{X\0) from P{X), which 
is convex. Equation (11) is convex optimization problem, 
since all constraints are linear. Furthermore, Slaters condi¬ 
tion holds as long as the constraints are satisfiable and hence 
we have strong duality [3]. First, we will use this strong du¬ 
ality to show that the minimum of (11) is a fully factorized 
distribution. We then derive the dual function (Equation (4) 
in the main paper), and finally extend the analysis for the 
case when objective is relaxed by adding slack variable. 

I. Latent Distribution 

In this section, we show that the latent label distribu¬ 
tion P{X) can be modeled as the product of independent 
marginals without loss of generality. This is equivalent to 
showing that the latent distribution that achieves the global 
optimal value factorizes, while keeping the network param¬ 
eters 0 fixed. First, we simplify the cross entropy term in 
the objective function in (11) as follows: 

Hp^Q = -Ex^P [logQ(X|0)] 

n 

qi{xi\e) 


= [log *(2^*1^)] 

n 

= - [\ogqi{xi\0)] 

i=l 

n 

= -EE P{xi = l) log qi{l\0) (12) 

i=i lec 

We used the linearity of expectation and the fact that qi 
is independent of any variable Xj for j ^ i to sim¬ 
plify the objective, as shown above. Here, P{xi = 1) = 
Sx cc =z marginal distribution. 

Let’s now look at the Lagrangian dual function 

C{P,X,0 = -Hp + Hp\Q 

+ {b - AP) + ly (^P{X) - 
= -Hp + Hp\Q - Y, PAi,iP{xi = 1) 



= -Hp-YY = 0 (log%(«l^) + Ajp) 

i=i lec 


Hp\Q 



where Pp|g is a biased cross entropy term and Ai-i is the 
column of A corresponding to Here we use the fact 

that the linear constraints are formulated on the marginals to 
rephrase the dual objective. We will now show this objec¬ 
tive can be rephrased as a KL-divergence between a biased 
distribution Q and P. 

The biased cross entropy term can be rephrased 
as a cross entropy between a distribution 
Q{X\0,X) = Y{iqi{xi\0,X), where qi{xi\0,X) = 

^qi{xi\0) exp(^^^. A) is the biased CNN distribution and 

Zi is a local partition function ensuring qi sums to 1. This 
partition function is defined as 

Zi = y^exp {logqi{l\0) + Ajp) 

I 


= -Exn.P 






The cross entropy between P and Q is then defined as 


Hp\Q = -Y.P{X)\ogQ{X\9A) 

X 

= l)\ogqi{l\e,\) 

i I 

= -YJ2P(^i = Oi^°Sqiil\0,X)+AliX-logZi) 

i I 

= HpiQ +J2^og Zi (14) 


This allows us to rephrase (13) in terms of a KL- 
divergence between P and Q 

C{P,X,^) 

= -Hp + Hp^Q - log Zi+X^b+iy |^E ^ 

=D{P\\Q)-C + X'^b + iy(^P{X)--\^ , (15) 

where C = ^ • log Zi is a constant that depends on the local 
partition functions of Q. 

The primal objective (11) can be phrased as 
minp maxA>o,zy >C(P, A, z^) which is equivalent to the 
dual objective maxA>o,zy minp £(P, A, z/), due to strong 
duality. 

Maximizing the dual objective can be phrased as maxi¬ 
mizing a dual function 


£(A) = A^6—C+maxnfinP(P||(3)-l-z/P(X) — ij 

= X^b—C-\- min P(P||Q) 

^:Ex^W = l 

"-V-" 

0 

= A"^fe-ElogE®^P (logft(11+ Aj.iX ), ( 16 ) 

i I 

where the maximization of u can be rephrased as a con¬ 
straint on P i.e. P{X) = 1. Maximizing (16) is equiv¬ 
alent to minimizing the original constraint objective (11). 


Factorization The KL-divergence P (P11Q) is minimized 
at P = Q = Hi hence the minimal value of P fully 
factorizes over all variables for any assignment to the dual 
variables A. 


Dual function Using the definition qi{l\0) = 

^ exp(/i(/; 6>)) we can define the dual function with 


respect to fi 

£(A) = - y]log^exp(/,((; e)+AlX) + log 

I ^^ 

const. 

where the log partition function is constant and falls out in 
the optimization. 

II. Optimizing Constraints with Slack Variable 

The slack relaxed loss function is given by 

minimize - Hp -Ex^p [log Q{X\0)] (17) 

^-V-^ 

Hp\Q 

subject to AP > 6 — ^ ~ ^ 

X 

For any /3 < 0, a value of ^ ^ 00 will minimize the ob¬ 
jective and hence invalidate the corresponding constraints. 
Thus, for the remainder of the section we assume that 
/3 > 0. The Lagrangian dual to this loss is defined as 

£{P, X, p,j)=-Hp + Hp\Q + + a^(6 - 0 



We know that the dual variable 7 > 0 is strictly non¬ 
negative, as well as 

^£(P,A,i/,7) = /3-A-7 = 0. (19) 

This leads to the following constraint on A: 


0 < 7 = ^ — A. 


Hence the slack weight forms an upper bound on A < /3. 
Substituting (19) into (18) reduces the dual objective to the 
non-slack objective in (13), and the rest of the derivation is 
equivalent. 

III. Ablation Study for Parameter Selection 

In this section, we present results to analyze the sensitiv¬ 
ity of our approach with respect to the constraint parameters 
i.e. the upper and lower bounds. We performed line search 
along each of the bounds while keeping others fixed. The 
method is quite robust with a standard deviation of 0.73% 
in accuracy, averaged over all parameters, as shown in Ta¬ 
ble 4. These experiments are performed in the setting where 
image-level tag and 1-bit size supervision is available dur¬ 
ing training, as discussed in the paper. We attribute this ro¬ 
bustness to the slack variables that are learned per constraint 
per image. 





Fgnd lower 

Bgnd lower 

Bgnd upper 

mloU 

ai 

ao 

bo 

w/o CRF 

0.1 

0.2 

0.7 

40.5 

0.2 

0.2 

0.7 

40.6 

0.3 

0.2 

0.7 

40.6 

0.4 

0.2 

0.7 

39.6 

0.1 

0.1 

0.7 

40.5 

0.1 

0.3 

0.7 

40.4 

0.1 

0.4 

0.7 

40.5 

0.1 

0.5 

0.7 

40.4 

0.1 

0.2 

0.5 

36.6 

0.1 

0.2 

0.6 

38.9 

0.1 

0.2 

0.8 

39.7 


Table 4: Ablation study for sensitivity analysis of the 
CCNN optimization with respect to the chosen parameters. 
The paramters mentioned here are defined in Equations (8) 
and (9) in the main paper. Parameter values used in all other 
experiments are shown in bold. 

IV. Ablation Study in Semi-Supervised Setting 

In this section, we experiment by incorporating fully su¬ 
pervised images in addition to our weak objective. The ac¬ 
curacy curve is depicted in Figure 5. 



Weakly 40% Fully 

Supervised Supervision Supervised 

Number of Fully Annotated Images 

Figure 5: Ablation study with varying amount of fully su¬ 
pervised images. Our model makes good use of the addi¬ 
tional supervision. 
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